Homework #5

Due 11:59 pm EST, Friday April 1st, 2022.

Email your solutions (both .ipnb and .html files) to: compscbio@gmail.com.

Background:

That same sadistic post-doc in your lab has just read the RNA Velocity paper and is even more excited about it than about Cytotrace. You decide to see how well Velocity pseudotime compares to Cytotrace score derived pseudotime using data for which you can reasonably anchor the starting point of the trajectory inference.

The data

  1. The day 4 mESC data that we keep using over and over again This is the raw counts data, as well as spliced and unspliced counts. As in HW4, we have cleaned it (i.e. removed potential doublets, low quality cells, and mito, ribo and malat genes). In terms of pre-processing, all you have to do is deal with rarely detected genes and normalization.

  2. There is no second data set for this homework.

Your mission:

Run your usual pre-processing, clustering and annotation. Perform Cytotrace analysis. Then run RNA velocity and plot the velocity embeddings on either (1) the 1st and 3rd principal components, or (2) the UMAP embedding (which you need to figure out how to make from the scanpy documentation). Then compute velocity pseudotime. Finally, as you did in HW4, compute diffusion based pseudotime.

Determine how these three pseudotimes compare and produce an appropiate multi-panel figure to display them.

Is there a relationship between velocity confidence and discreptancies between velocity pseudotime and Cytotrace pseudotime and/or diffusion pseudotime?

We will plot everything together at the end

Above we have velocity embeddings on plotted umap in different formats (reg, grid, and stream). This allows us to see the 'flow'/transitions of the cells and the potential states they will be moving to since sc-rna-seq is a snapshot of gene expression of our samples frozen in time.

COMMENTS

Looking at both the 2d and 3d scatter plots, we can see that velocity pdt differs the most from the others. Velo pdt's are look 'reversed' when compared to the other two. In the 2d graphs we can see that velo pdt goes from 1 to 0 as move from left to right while the other 2 go from 0 to 1 as we move from left to right (ie the cells closer to the left of the graph are assigned higher pdts in velo while in ct and dpt, pdt assigned is lower the further left you go).

velo vs dpt violin plots

In these plots we can see that velo pdt's for leidens 0,1,2,3, and 5 are much higher than dpt, while velo's pdt for leiden 4 is shifted down (base of violins looks pretty close though .4-.6). Leiden 6 pdt is slightly higher for velo.

velo vs ct violin plots

For leidens 0,1,2,3,5 and 6, ct pdts are more spread out than velo's. These leidens are also assigned much lower pdt's than in velo. For leiden 4, it almost looks like the violin was flipped across the x-axis as velos pdt's for leiden 4 have a lower base and higher tip while ct is the opposite.

Trust

I think I would trust dpt or ct more than velo because I am pretty sure that dpt root is identified correctly with my function and ct looks more like dpt than velo. Maybe if we invert velo it would be more trustworthy? I've heard that there are some problems with rna velocity computations (Loyal Goff). Dr. Loyal Goff also provided the following paper detailing rna velo and some of its issues: https://www.biorxiv.org/content/10.1101/2022.02.12.480214v1

Bonus

• Bonus 2pts The assumptions of splicing-based velocity estimation do not hold for all genes. Explain in detail how scVelo addresses this. In particular, does the function “scv.tl.velocity” involve any filtering to remove genes with poor fit? If yes, how does it work? If not, does filtering happen anywhere else in the pipeline? For an example of something similar in a different software tool, look up the argument “min.nmat.emat.correlation” in the function “gene.relative.velocity.estimates” from the R package “velocyto.r” (docs here) and its scientific justification in extended data figure 3, panel e in La Manno et al.

min.nmat.emat.correlation: minimum required Spearman rank correlation between n and e counts of a gene

The linked tool requires a minimun spearman rank correlation between spliced and unspliced counts of a gene for it to be included

Looking that scv.tl.velocity docs, it does not seem like the command itself does any filtering. What it does do is 'determine velocities by quantifying how observations deviate from a presumed steady-state equilibrium ratio of unspliced to spliced mRNA levels. This steady-state ratio is obtained by performing a linear regression restricting the input data to the extreme quantiles. By including second-order moments, the stochastic model [Bergen19] exploits not only the balance of unspliced to spliced mRNA levels but also their covariation. By contrast, the likelihood-based dynamical model [Bergen19] solves the full splicing kinetics and generalizes RNA velocity estimation to transient systems. It is also capable of capturing non-observed steady states.' https://scvelo.readthedocs.io/scvelo.tl.velocity/

Although it does 'restrict input data to the extreme quantiles', I believe this is not done when running the command itself as hvg default is true. hvg is calculated prior to running this command. Instead, this is done with the command scvelo.pp.filter_and_normalize. This is the step where our genes are filtered and hvgs are identified. This command runs the following code:

scv.pp.filter_genes(adata)
scv.pp.normalize_per_cell(adata)
if n_top_genes is not None:
    scv.pp.filter_genes_dispersion(adata)
if log:
    scv.pp.log1p(adata)

I believe the command scv.pp.filter_genes is where the bad genes would be filtered out. 'Filter genes based on number of cells or counts. Keep genes that have at least min_counts counts or are expressed in at least min_cells cells or have at most max_counts counts or are expressed in at most max_cells cells. Only provide one of the optional parameters min_counts, min_cells, max_counts, max_cells per call.' https://scvelo.readthedocs.io/scvelo.pp.filter_genes/

But since all of the default parameters in the function filter_genes are none. I think that scv.tl.velocity relies on hvg's in order to calculate velocities. This is done with the command scv.pp.filter_genes_dispersion which is ran when you use scv.pp.filter_and_normalize. This should be all the 'filtering of bad genes' that scv.tl.velocity needs as it 'restricts input data to the extreme quantiles' anyway.

The final preprocessing step before running scv.tl.velocity is scv.pp.moments. This command 'Computes moments for velocity estimation.

First-/second-order moments are computed for each cell across its nearest neighbors, where the neighbor graph is obtained from euclidean distances in PCA space.' Like scv.tl.velocity, use hvg default is true. This means that only hvgs are used. This further cements my belief that scv.tl.velocity relies on hvg's and does not need any additional filtering ontop of what we already do in class (ie, remove ribo, mt, doublets, rarely detected genes, etc).